Similarity Computation in Novelty Detection and Biomedical Text Categorization

نویسندگان

Ming-Feng Tsai

Ming-Hung Hsu

Hsin-Hsi Chen

چکیده

The novelty track was first introduced in TREC 2002. Given a TREC topic, the goal of this task in 2004 is to locate relevant and new information from a set of documents. From the results in TREC 2002 and 2003, we realized the major challenging issue of recognizing relevant sentences is the lack of information used in similarity computation among sentences. In this year, we utilized the method based on variants of employing an information retrieval (IR) system to find relevant and novel sentences. This methodology is called IR with reference corpus, which can also be considered as an information expansion of sentences. A sentence is considered as a query of a reference corpus, and similarity between sentences is measured in terms of the weighting vectors of document lists ranked by IR systems. Basically, relevant sentences are extracted by comparing their results on a certain information retrieval system. Two sentences are regarded as similar if their corresponding returned document lists by the IR system are similar. In novelty parts, we used similar approach to extract novel sentences from the sentences of the relevant part. An effectively dynamic threshold setting approach that is based on what percentage of relevant sentences is within a relevant document is presented. In this paper, we paid attention to three points: first, how to utilize the results of an IR system to compare the similarity between sentences; second, how to filter out the redundant sentences; third, how to determine appropriate relevance and novelty threshold.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Some Similarity Computation Methods in Novelty Detection

In the novelty task, the amount of information of a sentence that can be used in similarity computation is the major challenging issue. Some sort of information expansion methods was introduced to tackle this problem. Our approach to relevance identification was to expand the information of a sentence with the context of this sentence using a sliding window method. The similarity was measured b...

متن کامل

Linear-Time Computation of Similarity Measures for Sequential Data

Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams...

متن کامل

Finding Topic-specific Strings in Text Categorization and Opinion Mining Contexts

In this paper, we present a new probabilistic method for automatically extracting topic-specific strings in a text categorization context. The advantage of this method is twofold. First, it allows us to automatically point out the expressions characterizing a specific topic category for a potential knowledge modelling. Second, it contributes to improve categorization results by providing to the...

متن کامل

Genetic Algorithm Based Text Categorization Using OLEX Method

The system describes new similarity-based genetic algorithm (GA) and thresholding Strategies (R&SCut variants). GA was designed to give appropriate weights to terms according to their semantic content and importance by using their co-occurrence information and the discriminating power values for similarity computation. After investigating the existing common thresholding strategies, design mult...

متن کامل

Text Categorization with a Small Number of Labeled Training Examples

This thesis describes the investigation and development of supervised and semisupervised learning approaches to similarity-based text categorization systems. It uses a small number of manually labeled examples for training and still maintains effectiveness. The purpose of text categorization is to automatically assign arbitrary raw documents to predefined categories based on their contents. Tex...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Similarity Computation in Novelty Detection and Biomedical Text Categorization

نویسندگان

چکیده

منابع مشابه

Some Similarity Computation Methods in Novelty Detection

Linear-Time Computation of Similarity Measures for Sequential Data

Finding Topic-specific Strings in Text Categorization and Opinion Mining Contexts

Genetic Algorithm Based Text Categorization Using OLEX Method

Text Categorization with a Small Number of Labeled Training Examples

عنوان ژورنال:

اشتراک گذاری